high quality dataset
Data Quality in Imitation Learning
In supervised learning, the question of data quality and curation has been sidelined in recent years in favor of increasingly more powerful and expressive models that can ingest internet-scale data. However, in offline learning for robotics, we simply lack internet scale data, and so high quality datasets are a necessity. This is especially true in imitation learning (IL), a sample efficient paradigm for robot learning using expert demonstrations. Policies learned through IL suffer from state distribution shift at test time due to compounding errors in action prediction, which leads to unseen states that the policy cannot recover from.Instead of designing new algorithms to address distribution shift, an alternative perspective is to develop new ways of assessing and curating datasets. There is growing evidence that the same IL algorithms can have substantially different performance across different datasets. This calls for a formalism for defining metrics of data quality that can further be leveraged for data curation.In this work, we take the first step toward formalizing data quality for imitation learning through the lens of distribution shift: a high quality dataset encourages the policy to stay in distribution at test time. We propose two fundamental properties that are necessary for a high quality datasets: i) action divergence: the mismatch between the expert and learned policy at certain states; and ii) transition diversity: the noise present in the system for a given state and action. We investigate the combined effect of these two key properties in imitation learning theoretically, and we empirically analyze models trained on a variety of different data sources. We show that state diversity is not always beneficial, and we demonstrate how action divergence and transition diversity interact in practice.
Data Quality in Imitation Learning
In supervised learning, the question of data quality and curation has been sidelined in recent years in favor of increasingly more powerful and expressive models that can ingest internet-scale data. However, in offline learning for robotics, we simply lack internet scale data, and so high quality datasets are a necessity. This is especially true in imitation learning (IL), a sample efficient paradigm for robot learning using expert demonstrations. Policies learned through IL suffer from state distribution shift at test time due to compounding errors in action prediction, which leads to unseen states that the policy cannot recover from.Instead of designing new algorithms to address distribution shift, an alternative perspective is to develop new ways of assessing and curating datasets. There is growing evidence that the same IL algorithms can have substantially different performance across different datasets.
Top Big AI Trends and Challenges Impacting Media, Advertising & Entertainment Industry
I recently interviewed some of the top data science leaders from Comcast/Freewheel, Condé Nast, ViacomCBS, Audoir, USA Today Network, and Samba TV on the biggest trends, challenges, and opportunities they see for ML & AI in media, advertising, & entertainment -- and what the future may hold. What are some of the biggest trends you'll see being adopted by the entertainment and media industries? Christopher Whitely, Senior Director of Applied Analytics at Comcast/FreeWheel, shares "There are a few areas that we'll see adopted by M&E industries in the coming months and years, including more contextual advertising, where advertising creative assets are matched to appropriate program content algorithmically. Federated learning is also a new trend, which refers to modeling using machine learning without sharing data sets. Privacy is important, so I expect we'll see continued use of aggregated customer segments and clean rooms for marketing and analytics. Also, lookalike models will help advertisers reach potential customers and optimize campaigns for the greatest effect."
- Media (1.00)
- Information Technology > Security & Privacy (1.00)
How LinkedIn, Uber, Lyft, Airbnb and Netflix are Solving Data Management and Discovery for Machine Learning Solutions
When comes to machine learning, data is certainly the new oil. The processes for managing the lifecycle of datasets are some of the most challenging elements of large scale machine learning solutions. Data ingestion, indexing, search, annotation, discovery are some of the aspects required to maintain high quality datasets. The complexity of these challenges increase linearly with the size and number of the target datasets. While it is relatively easy to manage training datasets for a single machine learning model, scaling that process across thousands of dataset and hundreds of models can become nothing short of a nightmare. Some of the companies at the forefront of machine learning innovation such as LinkedIn, Uber, Netflix, Airbnb or Lyft have certainly experienced the magnitude of this challenge and they have built specific solutions to address it.
- Education (1.00)
- Information Technology > Services (0.97)
- Transportation > Passenger (0.62)
- (4 more...)